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to the input variables. Smoothness of the network mapping is an important 
property in connection with the generalization performance of a network, as is 
discussed in greater detail in Section 9.2. The second reason is that the function 
g depends on the particular function y(x) which we wish to represent. This is 
the converse of the situation which we generally encounter with neural networks. 
Usually, we consider fixed activation functions, and then adjust the number of 
hidden units, and the values of the weights and biases, to give a sufficiently close 
representation of the desired mapping. In Kolmogorov's theorem the number of 
hidden units is fixed, while the activation functions depend on the mapping. In 
general, if we are trying to represent an arbitrary continuous function then we 
cannot hope to do this exactly with a finite number of fixed activation functions 
since the finite number of adjustable parameters represents a finite number of 
degrees of freedom, and a general continuous function has effectively infinitely 
many degrees of freedom. 

4.8 Error back-propagation 

So far in this chapter we have concentrated on the representational capabilities of 
multi-layer networks. We next consider how such a network can learn a suitable 
mapping from a given data set. As in previous chapters, learning will be based on 
the definition of a suitable error function, which is then minimized with respect 
to the weights and biases in the network. 

Consider first the case of networks of threshold units. The final layer of 
weights in the network can be regarded as a perceptron with inputs given by 
the outputs of the last layer of hidden units. These weights could therefore be 
chosen using the perceptron learning rule introduced in Chapter 3. Such an ap- 
proach cannot, however, be used to determine the weights in earlier layers of 
the network. Although such layers could in principle be regarded as being like 
single-layer perceptrons, we have no procedure for assigning target values to their 
outputs, and so the perceptron procedure cannot be applied. This is known as 
the credit assignment problem. If an output unit produces an incorrect response 
when the network is presented with an input vector we have no way of determin- 
ing which of the hidden units should be regarded as responsible for generating 
the error, so there is no way of determining which weights to adjust or by how 
much. 

The solution to this credit assignment problem is relatively simple. If we 
consider a network with differentiable activation functions, then the activations 
of the output units become differentiable functions of both the input variables, 
and of the weights and biases. If we define an error function, such as the sum-of- 
squares error introduced in Chapter 1, which is a differentiable function of the 
network outputs, then this error is itself a differentiable function of the weights. 
We can therefore evaluate the derivatives of the error with respect to the weights, 
and these derivatives can then be used to find weight values which minimize the 
error function, by using either gradient descent or one of the more powerful 
optimization methods discussed in Chapter 7. The algorithm for evaluating the 
derivatives of the error function is known as back-propagation since, as we shall 
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see, it corresponds to a propagation of errors backwards through the network. 
The technique of back-propagation was popularized in a paper by Rumelhart, 
Hinton and Williams (1986). However, similar ideas had been developed earlier 
by a number of researchers including Werbos (1974) and Parker (1985). 

It should be noted that the term back- propagation is used in the neural com- 
puting literature to mean a variety of different things. For instance, the multi- 
layer perceptron architecture is sometimes called a back- propagation network. 
The term back- propagation is also used to describe the training of a multi-layer 
perceptron using gradient descent applied to a sum-of-squares error function. In 
order to clarity the terminology it is useful to consider the nature of the training 
process more carefully Most training algorithms involve an iterative procedure 
for minimization of an error function, with adjustments to the weights being 
made in a sequence of steps. At each such step we can distinguish between 
two distinct stages. In the first stage, the derivatives of the error function with 
respect to the weights must be evaluated. As we shall see, the important con- 
tribution of the back-propagation technique is in providing a computationally 
efficient method for evaluating such derivatives. Since it is at this stage that 
errors are propagated backwards through the network, we shall use the term 
back-propagation specifically to describe the evaluation of derivatives. In the 
second stage, the derivatives are then used to compute the adjustments to be 
made to the weights. The simplest such technique, and the one originally con- 
sidered by Rumelhart et ai (1986), involves gradient descent. It is important to 

; recognize that the two stages are distinct. Thus, the first stage process, namely 
the propagation of errors backwards through the network in order to evaluate 
derivatives, can be applied to many other kinds of network and not just the 
multi-layer perceptron. It can also be applied to error functions other that just 

: the simple sum-of-squares, and to the evaluation of other derivatives such as the 
Jacobian and Hessian matrices, as we shall see later in this chapter. Similarly, the 

i second stage of weight adjustment using the calculated derivatives can be tack- 
led using a variety of optimization schemes (discussed at length in Chapter 7), 
many of which are substantially more powerful than simple gradient descent. 

4.8.1 Evaluation of error function derivatives 

We now derive the back- propagation algorithm for a general network having 
arbitrary feed-forward topology, and arbitrary differentiable non-linear activation 
functions, for the case of an arbitrary differentiable error function. The resulting 
formulae will then be illustrated using a simple layered network structure having 
a single layer of sigmoidal hidden units and a sum-of-squares error. 

In a general feed-forward network, each unit computes a weighted sum of its 
. inputs of the form 
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aj^^WjiZi (4.26) 

i 

where is the activation of a unit, or input, which sends a connection to unit 



h-> -jr.- v. . - 
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j y and wji is the weight associated with that connection. The summation runs 
over all units which send connections to unit j. In Section 4.1 we showed that 
biases can be included in this sum by introducing an extra unit, or input, with 
activation fixed at +1. We therefore do not need to deal with biases explicitly. 
The sum in (4.26) is transformed by a non-linear activation function g(-) to give 
the activation zj of unit j in the form 

(4.27) 

Note that one or more of the variables in the sum in (4.26) could be an input, 
in which case we shall denote it by x<. Similarly, the unit j in (4.27) could be an 
output unit, in which case we denote its activation by y*. 

As before, we shall seek to determine suitable values for the weights in the 
network by minimization of an appropriate error function. Here we shall consider 
error functions which can be written as a sum, over all patterns in the training 
set, of an error denned for each pattern separately 

E = J2 En (4.28) 

n 

where n labels the patterns. Nearly all error functions of practical interest take 
this form, for reasons which are explained in Chapter 6. We shall also suppose 
that the error E* can be expressed as a differentiable function of the network 
output variables so that 

E n =E n (y u ...,y c ). (4.29) 

Our goal is to find a procedure for evaluating the derivatives of the error function 
E with respect to the weights and biases in the network. Using (4.28) we can 
express these derivatives as sums over the training set patterns of the derivatives 
for each pattern separately. From now on we shall therefore consider one pattern 
at a time. 

For each pattern we shall suppose that we have supplied the corresponding 
input vector to the network and calculated the activations of all of the hidden 
and output units in the network by successive application of (4.26) and (4.27). 
This process is often called forward propagation since it can be regarded as a 
forward flow of information through the network. 

Now consider the evaluation of the derivative of E* 1 with respect to some 
weight Wji. The outputs of the various units will depend on the particular input 
pattern n. However, in order to keep the notation uncluttered, we shall omit 
the superscript n from the input and activation variables. First we note that 
E 71 depends on the weight Wji only via the summed input a,- to unit j. We can 
therefore apply the chain rule for partial derivatives to give 



£mor back-propagation 

dE n _ dff" da., 
dwji da, dwji 



We now introduce a useful notation 



6i = l^~ 
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(4.30) 



(4.31) 



where the 6>s are often referred to as errors for reasons we shall see shortly. Using 
(4.26) we can write 



da.j 

= z i- 

dwji 

Substituting (4.31) and (4.32) into (4.30) we then obtain 



dviji 



= 6jZi. 



(4.32) 



(4.33) 



Note that this has the same general form as obtained for single-layer networks 
in Section 3.4. Equation (4.33) tells us that the required derivative is obtained 
simply by multiplying the value of 6 for the unit at the output end of the weight 
by the value of z for the unit at the input end of the weight (where z - 1 in 
the case of a bias). Thus, in order to evaluate the derivatives, we need only to 
calculate the value of 6, for each hidden and output unit in the network, and 

then apply (4.33). . 
For the output units the evaluation of 6 k is straightforward. From the defam- 

tion (4.31) we have 



e dE n „ ,9E" 
6 k = ~ — = 9 (<**)- 



da k 



dyk 



(4.34) 



. site 



where we have used (4.27) with z k denoted by y k . In order to evaluate (4.34) we 
substitute appropriate expressions for g'(a) and dE"/dy. This will be illustrated 
with a simple example shortly. 

To evaluate the 6's for hidden units we again make use of the chain rule tor 

partial derivatives, 



5j = ^7 _ V 



da k da.j 



(4.35) 



twhere the sum runs over all units k to which unit j sends connections The 
fatrangenient of units and weights is illustrated in Figure 4.16. Note that the 
Site labelled k could include other hidden units and/or output units. In writing 
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8. 




Figure 4.16. Illustration of the calculation of 6j for hidden unit j by back- 
propagation of the 6*8 from those units k to which unit j sends connections. 

down (4.35) we are making use of the fact that variations in a, give rise to 
variations in the error function only through variations in the variables a*. If we 
now substitute the definition of 6 given by (4.31) into (4.35), and make use of 
(4.26) and (4.27), we obtain the following back-propagation formula 

*i = $'(aj)l>*i** (4.36) 
k 

which tells us that the value of 6 for a particular hidden unit can be obtained by 
propagating the 6's backwards from units higher up in the network, as illustrated 
in Figure 4.16. Since we already know the values of the £'s for the output units, 
it follows that by recursively applying (4.36) we can evaluate the j's for all of 
the hidden units in a feed-forward network, regardless of its topology. 

We can summarize the back-propagation procedure for evaluating the deriva- 
tives of the error E* 1 with respect to the weights in four steps: 

1. Apply an input vector x n to the network and forward propagate through 
the network using (4.26) and (4.27) to find the activations of all the hidden 
and output units. 

2. Evaluate the 6* for all the output units using (4.34). 

3. Back-propagate the 6's using (4.36) to obtain 6j for each hidden unit in 
the network. 

4. Use (4.33) to evaluate the required derivatives. 

The derivative of the total error E can then be obtained by repeating the above 
steps for each pattern in the training set, and then summing over all patterns: 



z 



f; 
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(4.37) 



In the above derivation we have implicitly assumed that each hidden or output 
unit in the network has the same activation function g( ). The derivation is 
easily generalized, however, to allow different units to have individual activation 
functions, simply by keeping track of which form of g(>) goes with which unit. 

4.8.2 A simple example 

The above derivation of the back-propagation procedure allowed for general 
forms for the error function, the activation functions and the network topol- 
ogy. In order to illustrate the application of this algorithm, we shall consider a 
particular example. This is chosen both for its simplicity and for its practical 
importance, since many applications of neural networks reported in the litera- 
ture make use of this type of network. Specifically, we shall consider a two-layer 
network of the form illustrated in Figure 4.1, together with a sum-of-squares 
error. The output units have linear activation functions while the hidden units 
have logistic sigmoid activation functions given by (4.10), and repeated here: 



9(a) ^ 



1 +exp(-a)' 



(4.38) 



A useful feature of this function is that its derivative can be expressed in a 
particularly simple form: 



g'(a)=g(a)(l-g(a)). 



(4.39) 



In a software implementation of the network algorithm, (4.39) represents a con- 
venient property since the derivative of the activation can be obtained efficiently 
from the activation itself using two arithmetic operations. 

For the standard sum-of-squares error function, the error for pattern n is 
given by 



(4.40) 



fc=i 



where yjt is the response of output unit fc, and t k is the corresponding target, for 
a particular input pattern x n . 

Using the expressions derived above for back- propagation in a general net- 
work, together with (4.39) and (4.40), we obtain the following results. For the 
output units, the 6 y s are given by 



6k=Vk- tk 



(4.41) 
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while for units in the hidden layer the 6's are found using 

c 

Sj = zj(l - Zj ) w kj 6 k (4.4 

where the sum runs over all output units. The derivatives with respect to tl 
first-layer and second-layer weights are then given by 

dE" . dE n r 

So far we have discussed the evaluation of the derivatives of the error functio 
with respect to the weights and biases in the network. In order to turn this int 
a learning algorithm we need some method for updating the weights based o 
these derivatives. In Chapter 7 we discuss several such parameter optimizatio 
strategies in some detail. For the moment, we consider the fixed-step gradien 
descent technique introduced in Section 3.4. We have the choice of updating th 
weights either after presentation of each pattern (on-line learning) or after firs 
s umming the derivatives over all the patterns in the training set (batch learning) 
In the former case the weights in the first layer are updated using 

Awji = -T)6jXi (4.44 
while in the case of batch learning the first-layer weights are updated using 

At0 ii = -f ? 5^^x? (4.45; 

n 

with analogous expressions for the second-layer weights. 
4.8.3 Efficiency of back-propagation 

One of the most important aspects of back-propagation is its computational 
efficiency. lb understand this, let us examine how the number of computer op- 
erations required to evaluate the derivatives of the error function scales with the 
size of the network. Let W be the total number of weights and biases. Then a 
single evaluation of the error function (for a given input pattern) would require 
0(W) operations, for sufficiently large W. This follows from the fact that, except 
for a network with very sparse connections, the number of weights is typically 
much greater than the number of units. Thus, the bulk of the computational 
effort in forward propagation is concerned with evaluating the sums in (4.26), 
with the evaluation of the activation functions representing a small overhead. 
Each term in the sum in (4.26) requires one multiplication and one addition, 
leading to an overall computational cost which is 0(W). 

For W weights in total there are W such derivatives to evaluate. If we simply 
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took the expression for the error function and wrote down explicit formulae for 
the derivatives and then evaluated them numerically by forward propagation, we 
would have to evaluate W such terms (one for each weight or bias) each requiring 
0(W) operations. Thus, the total computational effort required to evaluate all 
the derivatives would scale as 0{ W 2 ). By comparison, back- propagation allows 
the derivatives to be evaluated in 0(W) operations. This follows from the fact 
that both the forward and the backward propagation phases are 0(W), and the 
evaluation of the derivative using (4.33) also requires G{W) operations. Thus 
back-propagation has reduced the computational complexity from 0(W 2 ) to 
0{W) for each input vector. Since the training of MLP networks, even using 
back-propagation, can be very time consuming, this gain in efficiency is crucial. 
For a total of N training patterns, the number of computational steps required 
to evaluate the complete error function for the whole data set is TV times larger 
than for one pattern. 

The practical importance of the 0{W) scaling of back-propagation is anal- 
ogous in some respects to that of the fast Fourier transform (FFT) algorithm 
(Bingham, 1974; Press et a/., 1992) which reduces the computational complex- 
ity of evaluating an L-point Fourier transform from 0(L 2 ) to 0(Llog 2 L). The 
idiscovery of this algorithm led to the widespread use of Fourier transforms in a 
large range of practical applications. 

Hi 4.8.4 Numerical differentiation 

An alternative approach to back-propagation for computing the derivatives of 
||||« the error function is to use finite differences. This can be done by perturbing 
"3§pf^ each weight in turn, and approximating the derivatives by the expression 




dwji e { } 



(4.46) 



where e <C 1 is a small quantity. In a software simulation, the accuracy of the 
approximation to the derivatives can be improved by making e smaller, until 
numerical roundoff problems arise. The main problem with this approach is that 
the highly desirable G(W) scaling has been lost. Each forward propagation re- 
quires 0(W) steps, and there are W weights in the network each of which must 
jjbe perturbed individually, so that the overall scaling is G{ W 2 ). However, finite 
differences play an important role in practice, since a numerical comparison of 
,the derivatives calculated by back-propagation with those obtained using finite 
differences provides a very powerful check on the correctness of any software 
implementation of the back-propagation algorithm. 

The accuracy of the finite differences method can be improved significantly 
by using symmetrical central differences of the form 



dw 



= B»(v, Ji +e)-E»(w ji -e) + 2 
2e 



(4.47) 
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In this case the 0(c) corrections cancel, as is easily verified by Taylor expan- 
sion on the right-hand side of (4.47), and so the residual corrections are ©(e 2 ). 
The number of computational steps is, however, roughly doubled compared with 
(4.46). 

We have seen that the derivatives of an error function with respect to the 
weights in a network can be expressed efficiently through the relation 

dE dE , % 

dw^i = 35* < 448 > 

Instead of using the technique of central differences to evaluate the derivatives 
dE n /dwji directly, we can use it to estimate dE n /daj since 

g = ^(a J+ e)-^(a i - £ ) +0(e2) ^ 

We can then make use of (4.48) to evaluate the required derivatives. Because the 
derivatives with respect to the weights are found from (4.48) this approach is 
still relatively efficient. Back-propagation requires one forward and one backward 
propagation through the network* each taking 0(W) steps, in order to evaluate 
all of the dE/ddi. By comparison, (4.49) requires 2M forward propagations, 
where M is the number of hidden and output nodes. The overall scaling is there- 
fore proportional to MW, which is typically much less than the ©(W^ 2 ) scaling 
of (4.47), but more than the 0(W) scaling of back-propagation. This technique 
is called node perturbation (Jabri and Flower, 1991), and is closely related to the 
madeline III learning rule (Widrow and Lehr, 1990). 

In a software implementation, derivatives should be evaluated using back- 
propagation, since this gives the greatest accuracy and numerical efficiency. How- 
ever, the results should be compared with numerical differentiation using (4.47) 
tor a few test cases in order to check the correctness of the implementation. 

4.9 The Jacobian matrix 

We have seen how the derivatives of an error function with respect to the weights 
can be obtained by the propagation of errors backwards through the network. 
The technique of back-propagation can also be applied to the calculation of 
other derivatives. Here we consider the evaluation of the Jacobian matrix, whose 
elements are given by the derivatives of the network outputs with respect to the 
inputs 

A,.fg (4.50) 



where each such derivative is evaluated with all other inputs held fixed. Note 
that the term Jacobian matrix is also sometimes used to describe the derivatives 
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of the error function with respect to the network weights, as calculated earlier 
using back-propagation. The Jacobian matrix provides a measure of the local 
sensitivity of the outputs to changes in each of the input variables, and is useful 
in several contexts in the application of neural networks. For instance, if there 
are known errors associated with the input variables, then the Jacobian matrix 
allows these to be propagated through the trained network in order to estimate 
their contribution to the errors at the outputs. Thus, we have 

In general, the network mapping represented by a trained neural network will 
be non-linear, and so the elements of the Jacobian matrix will not be constants 
but will depend on the particular input vector used. Thus (4.51) is valid only for 

I small perturbations of the inputs, and the Jacobian itself must be re-evaluated 

• for each new input vector. 

f The Jacobian matrix can be evaluated using a back-propagation procedure 
I which is very similar to the one derived earlier for evaluating the derivatives of 
an error function with respect to the weights. We start by writing the element 
Jfci in the form 

ft 

/ = ^Mh- = V dy* dflj 

ki dxi Y daj dxi 




w 



dyk 



where we have made use of (4.26). The sum in (4.52) runs over all units j to 
Vhich the input unit i sends connections (for example, over all units in the first 
hidden layer in the layered topology considered earlier). We now write down a 
recursive back-propagation formula to determine the derivatives dyk /daj 

dyt _ y> dy k daj 
da; ~ da, da. 



dai daj 



where the sum runs over all units J to which unit j sends connections. Again, we 
have made use of (4.26) and (4.27). This back-propagation starts at the output 
units for which, using (4.27), we have 
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o£=y(°k)6kV (4.54) 

where Skk. is the Kronecker delta symbol, and equals 1 if * = jf and 0 otherwise. 
We can therefore summarize the procedure for evaluating the Jacobian matrix 
as, follows. Apply the input vector corresponding to the point in input space at 
which the Jacobian matrix is to be found, and forward propagate in the usual 
way to obtain the activations of all of the hidden and output units in the network. 
Next, for each row k of the Jacobian matrix, corresponding to the output unit Jfc, 
back-propagate using the recursive relation (4.53), starting with (4.54), for all of 
the hidden units in the network. Finally, use (4.52) to do the back-propagation 
to the inputs. The second and third steps are then repeated for each value of k y 
corresponding to each row of the Jacobian matrix. 

The Jacobian can also be evaluated using an alternative forward propagation 
formalism which can be derived in an analogous way to the back-propagation 
approach given here (Exercise 4.6). Again, the implementation of such algorithms 
can be checked by using numerical differentiation in the form 

^- ^"^-W )- (4.55) 



4.10 The Hessian matrix 

We have shown how the technique of back-propagation can be used to obtain the 
first derivatives of an error function with respect to the weights in the network. 
Back-propagation can also be used to evaluate the second derivatives of the error, 
given by 

&E 

(4.56) 



dwjidwik 



These derivatives form the elements of the Hessian matrix, which plays an im- 
portant role in many aspects of neural computing, including the following: 

1. Several non-linear optimization algorithms used for training neural net- 
works are based on considerations of the second-order properties of the 
error surface, which are controlled by the Hessian matrix (Chapter 7). 

2. The Hessian forms the basis of a fast procedure for re-training a feed- 
forward network foUowing a small change in the training data (Bishop, 
1991a). 

3. The inverse of the Hessian has been used to identify the least signifi- 
cant weights in a network as part of network 'pruning' algorithms (Sec- 
tion 9.5.3). 

4. The inverse of the Hessian can also be used to assign error bars to the 
predictions made by a trained network (Section 10.2). 
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Abstract— \ number of steepest descent algorithms have been 
developed for adapting discrete-time dynamical systems, includ- 
ing the backpropagation through time and recursive backprop- 
agation algorithms. In this paper, a tutorial on the use of these 
algorithms for adapting neural network controllers and filters 
is presented. In order to effectively compare and contrast the 
algorithms, a unified framework for the algorithms is developed 
This framework is based upon a standard representation of 
a discrete-time dynamical system. Using this framework, the 
computational and storage requirements of the algorithms are 
derived. These requirements are used to select the appropriate 
algorithm for training a neural network controller or filter. 
Finally, to illustrate the usefulness of the techniques presented 
in this paper, a neural network control example and a neural 
network filtering example are presented. 

I. INTRODUCTION 

THE ABILITY of humans to control and interact with a 
complex environment has motivated our study of adaptive 
discrete-time dynamical systems which are fully or at least 
partially composed of neural networks. Humans regularly per- 
form certain tasks easily and proficiently which have proven 
difficult to reproduce with machines. Examples of such tasks 
include driving a car, recognizing the differences between a 
cat and a dog, and understanding spoken language. Humans 
learn how to accomplish these tasks in part by interacting with, 
manipulating and eventually controlling their environment. In 
order to build machines which accomplish these difficult tasks, 
it may be necessary both to mimic humans learning through 
environmental interaction and to model the low level functions 
of the human nervous system. The algorithms presented in this 
paper provide one possible technique for accommodating both 
of these requirements. 

Although the human ability to control its environment has 
motivated us, our primary interest lies in the development of 
engineering tools for adaptive nonlinear control and filtering. 
Specifically, we have investigated adaptive algorithms that 
allow the design of closed-loop nonlinear controllers and 
nonlinear infinite impulse response (IIR) filters. Sigmoidal 
neural networks of the type described by Rumelhart [1] are 
used to implement the controllers and filters. Neural networks 
of this type were selected because they are capable of learning 
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a nonlinear function to any arbitrary degree of accuracy [2], 
[3]. 

A. A Unified Framework 

The algorithms presented in this paper are based on steepest 
descent methods [4], [5] for calculating changes to adaptive 
parameters of a discrete-time dynamical system. Most of 
the algorithms presented in this paper have been previously 
published. Our goal in this paper is to present these algorithms 
in a unified framework so that their application to neural 
network control and filtering can be easily understood. 

It is our hope that this unified framework will prove useful 
to those who are both familiar and unfamiliar with steepest 
descent approaches to training discrete-time dynamical sys- 
tems. For those who are not familiar with these algorithms, 
we hope that the unified treatment of the theory makes it 
easier to understand the major differences between the various 
algorithms. For those who are familiar with these algorithms, 
we hope that the unified framework provides insight into some 
of the more subtle differences. 

In this paper, we divide the theory concerning the training 
of adaptive discrete-time dynamical systems into three distinct 
layers. The three layers include the mathematical theory, 
general techniques for training dynamical systems, and specific 
algorithms for training dynamical systems composed of neural 
networks. The three layers are hierarchically structured with 
the mathematical theory at the top and the specific algorithms 
at the bottom. As shown in this paper, by starting at the top 
layer and working down, a natural unified framework for the 
steepest descent algorithms emerges. 

The top layer includes the mathematical theory concerning 
the calculation of an error gradient, which is used to drive 
the steepest descent algorithms. This theory, which was de- 
veloped by Paul Werbos and is well-developed in his thesis 
[6], provides a formal technique for calculating gradients of 
discrete-time dynamical systems. As shown in Section III, his 
theory is based upon the ordered partial derivative [6]. 

The middle layer of the theory is composed of two general 
algorithms for updating the weights of an adaptive discrete- 
time dynamical system. 1 As shown in this paper, the two 
techniques, which are referred to as the backsweep and re- 
cursive algorithms, result from the use of different chain rule 

'As shown in Section V. the general algorithms must be divided into 
cpochwise or on-line versions (epochwise and on-line refer to the mode 
of operation of the dynamical system): therefore, there are actually four 
general algorithms. In this introduction section, it is convenient to ignore 
The epochwise and on-line difference, and treat the four general algorithms as 
if they were only two. 
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expansions of the ordered partial derivative. Both techniques 
have their origins in the 1960s. The backsweep technique for 
calculating the gradient of a discrete-time dynamical system 
with a single tap-delayed output was developed by Bryson 
[7]. This technique has been extensively applied in the field 
of optimal control. In this paper, we present the multiple- 
delayed version of this algorithm. The recursive technique for 
adapting dynamical systems was developed independently by 
Narendra and McBride [8] [9], and Kokotovic [10]. (A paper 
by Narendra and Parthasarathy [11] reviews the relationship 
between the work done in the 1960's and recent research on 
neural networks.) Because the recursive algorithm is well- 
suited for real-time learning, it's earliest applications were in 
the area of adaptive control. 

The lowest layer of the theory includes the specific al- 
gorithms for training different types of dynamical systems. 
This layer includes the specific algorithms used to train neural 
network controllers and filters. An example of one of the 
algorithms which compose the lowest level of the theory is 
the backpropagation through time algorithm [1] [12]— [18]. 
This algorithm consists of using the general backsweep tech- 
nique to train a specific architecture, a dynamical system 
composed solely of a feedforward neural network. 2 As shown 
in this paper, the backpropagation through time algorithm is 
characterized by low computational cost and high storage 
requirements. Furthermore, it is not an inherently on-line 
algorithm. The algorithm was popularized by its description in 
Chapter 8 of the well-known PDP book [1]. It was further pop- 
ularized by its use for several control applications [ 12]— [14]. A 
number of good papers on backpropagation through time have 
been published over the past few years, including [19]— [28]. 
Most of the applications of this algorithm have been in the 
fields of control and speech processing. Because not all the 
details of the backpropagation through time algorithm are 
covered in this paper, a reader interested in further insights 
should refer to the referenced papers. 

The use of the recursive technique for training dynamical 
systems composed of neural networks, which is referred to as 
the recursive backpropagation algorithm, is another example 
of an algorithm which is a member of the lowest level of the 
theory. As shown in this paper, the recursive backpropagation 
algorithm [15], [16], [29]-[32], which is often referred to as 
the real-time learning algorithm, is characterized by high com- 
putational cost and low storage requirements. Furthermore, it is 
an inherently on-line learning algorithm which is well suited 
for real-time learning. The algorithm was initially proposed 
by a number of different researchers [15], [16], [29]-[32]. 
Important applications of the algorithm have occurred in the 
areas of control [11], [33], [34], and speech processing [15], 
[16], [32], [35]. 

A review of many of the fundamental papers on the back- 
propagation through time and recursive backpropagation al- 
gorithms is given in [361. Finally, in [37], the relationship 

- It is important to note that the backsweep technique and backpropagation 
through time algorithm are not one and the same. Instead, backpropagation 
through time is (he application of the backsweep technique to a specific 
dynamical system architecture, the feedforward neural network. 



between the dynamical neural network algorithms is shown 
using graph theory. 

B. Potential Drawbacks of the Algorithms 

Before proceeding with the development of the algorithms, 
two important drawbacks to training dynamical systems must 
be pointed out. First, for discrete-time dynamical systems 
trained using steepest descent algorithms, the weights (adap- 
tive parameters) are not guaranteed to converge to values 
which would result in the global minimum of the error 
function. Neural network researchers should not find this 
surprising because training algorithms for static feedforward 
neural network have the same drawback. However, it should 
be noted that, in our experience, convergence to a non- 
acceptable local minimum is a more significant problem when 
training dynamical systems than static systems. Second, the 
stability of the weight updates cannot be guaranteed because 
analytic stability conditions for the training algorithms have 
not yet been developed. Stability is also a problem when 
training feedforward neural networks, but once again, in 
our experience, stability is a more significant problem when 
training dynamical networks. 

These two drawbacks can be overcome by off-line develop- 
ment of the adaptive dynamical system. In such cases, if either 
non-optimal performance or instability occurs, the system 
can simply be retrained. If real-time applications of these 
algorithms are desired, conservative values of the training 
parameters should be selected to partially overcome these two 
problems. In such cases, the cost of non-optimal performance 
or instability should be small. 

Finally, it should be noted that in some cases it is possible to 
short-circuit the algorithms presented in this paper by feeding 
back the desired output response instead of the actual output. 
In this case, the feedback loop is broken and simpler, static 
algorithms can be used to train the adaptive dynamical system. 
This method of training dynamical systems is referred to as 
teacher forcing [29], [38], [39] and generally is much more 
computationally efficient than the algorithms presented in this 
paper. Hence, if the desired output responses are available 
at each iteration, it is probably best to try teacher forcing 
methods before resorting to the algorithms present in this 
paper. However, for many control and filtering applications, 
such as the two given in Section VII of this paper, the desired 
output responses are not available at each iteration and the 
computationally more expensive dynamical algorithms must 
be used. 

C. Outline 

This paper is composed of eight sections and two ap- 
pendices. Section II introduces a standard representation for 
a discrete-time dynamical system and illustrates how this 
representation can be used to model neural network controllers 
and filters. The mathematical theory concerning discrete-time 
dynamical systems is presented in Section III. Section IV 
presents the differences between epochwise and on-line train- 
ing. The general gradient calculation techniques along with the 
specific algorithms for adapting dynamical neural networks 
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/r(R k ,Y k ,W k G>) 




Fig. I. Standard representation. 

are given in Section V. Detailed derivations of the general 
algorithms are contained in the Appendices A and B. A 
comparison of the computational and storage requirements 
of the dynamical neural network algorithms is included in 
Section VI. This section also includes a discussion of the 
appropriate algorithms to use for neural network control and 
filtering applications. Section VII presents two applications of 
the algorithms, and Section VIII provides a conclusion. 

II. Definitions 
In this section, in order to provide a unified framework, a 
standard representation of a discrete-time dynamical system is 
proposed. It is shown that any discrete-time dynamical system, 
such as a neural network control system or IIR filter, can be 
expressed in this representation. 

A. The Standard Representation 

Fig. 1 shows a standard representation of a discrete-time 
dynamical system. This representation was introduced by 
Narendra [33]. In this figure, A: denotes the iteration of 
the discrete-time dynamical system (fc = 0 represents the 
first iteration). The input of the system is composed of two 
components. The first component, R fct is made up of an 
external input vector r k and J delayed versions of this vector, 
r*-i, . . . , r k -j. Each of the external input vectors is a length 
M colurnn vector (r* € R^ ix % therefore, K k may be 
defined as a length (J + 1)M column vector of the form 
Rfc = [r? i r fc-n • • ■ • r L jl r - The second component, Y fcf is 
composed of the previous L output vectors. The output vector 
at iteration & is a length N column vector (y k e R [ x ] )- 
Therefore, Y fc is a length LN column vector of the form 

The weight vector W(i), which contains all adaptive pa- 
rameters of the dynamical system, is defined as a length Q 
column vector (W(i) € Rl Qxl] ). The index i denotes the 
number of times the weight vector has been updated using a 
steepest descent rule. The initial weight vector W(0) is usually 
produced by a random number generator. We use W fc (z) to 
denote the use of W(i) at iteration k of the dynamical system. 
Finally, let w k {i) denote any weight of the vector W fc (i). 



Fig. 2. neural network controller and plant in the standard representation. 

By including all feedback states of a discrete-time dynami- 
cal system in the output vector y fc , any dynamical system can 
be written as 

y fc = /(R fc ,Y fc ,W fc (i)), 0) 

where the function /(•) contains no delays [33]. The steepest 
descent algorithms of Section V are all defined in terms of 
this standard representation. 

fi. Standard Representations of Control Systems and Filters 

In this section, in order to illustrate the flexibility of the 
standard representation, we present two discrete-time dynam- 
ical systems in their standard representations. Figure 2 shows 
how the standard representation may be used to model a 
neural network based control system. This dynamical system 
is composed of two basic components: a neural network 
controller and a plant model. 3 These two components are 
generically represented by /(R fcl Y fc> W fc (i)) in the standard 
representation. 

As shown in Fig. 2, the neural network control system 
operates in the following manner: An external reference signal, 
r fc , is fed directly into a neural network controller. Given 
this reference signal along with the state of the plant model, 
which is assumed to be contained in y fc _i, the neural network 
controller computes the control signal u k . The output of the 
plant model, y fc , is calculated using the control signal, u ky and 
the state of the model, y fc _i. It should be pointed out that it is 
assumed that all states of the discrete-time plant model which 
are contained in the plant's feedback loop are included in the 
output vector y fc . Under this assumption, it can be observed 
that the neural network control system of Fig. 2 is given in 
its standard representation. 

Before proceeding to the next example of a system in its 
standard representation, it is worth discussing the plant model 
of Fig. 2 in more detail. This plant model may take two 
different forms. The first and most general form is simply a 
set of equations which map previous state and current control 
effort to next state: y fc = / p (u fc ,y fc -i) [40]-[42]. If this set 
of equations is nonlinear, training the neural network of Fig. 
2 using a steepest descent algorithm results in a nonlinear 
state feedback controller. Thus, the steepest descent algorithms 
presented in this paper can be used as a general design 
techniques for developing nonlinear controllers. 

The second form of a plant model is a feedforward neural 
network [12], [33], [43]. In this case, the plant model takes 
3 When the system of Fig. 2 is used for real-world control, the plant model 
is replaced by the actual plant. 
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f(R„Y„W k (i)) 




Fig. 3. neural network filter in the standard representation. 

the form y k = / P (u fc ,y fc -.i, W*), where W is the weight 
vector of the neural network. This weight vector is derived by 
training the neural network on plant input-output data using 
the backpropagation algorithm [1], [44], [45]. Once trained, 
this model can be updated on-line using additional plant input- 
output data. Because the steepest descent algorithms presented 
in Section V use the plant model to update the controller, a 
neural network controller trained in this manner can adapt 
to slow changes in a plant. Therefore, the steepest descent 
algorithms can be used for nonlinear adaptive control. 

Figure 3 shows an example of a single-input, single-output, 
nonlinear IIR filter in the standard representation. In this 
example, the external filter input, r fc , along with a delayed 
version of this signal, r^-i, form one component of the input 
to a neural network. Delayed versions of the output, y*-i 
and yfc-2» form the other component of the input. The neural 
network in this example is assumed to have a sigmoidal hidden 
layer and a linear output unit. 

III. Ordered Partial Derivatives 

To update the weights using steepest descent, a partial 
derivative of the associated dynamical system must be cal- 
culated. Because a dynamical system contains feedback, the 
calculation of this derivative can be quite complex. The 
ordered partial derivative, which is a partial derivative whose 
constant and varying terms are defined using an ordered set 
of equations, provides a mathematical tool for easily finding 
derivatives of complex dynamical systems [6]. 

To define the ordered derivative, the concept of an 
ordered set of equations must first be introduced. Let 

be a set of n variables whose 
values are determined by a set of n equations. This set of 
equations is defined to be an ordered set of equations if each 
variable Z{ is a function only of the variables {z\, . . . , 
Thus, the equation for any variable of an ordered set of 
equations can be written as 



Because of the ordered nature of this set of equations, the 
variables {21, . . . , Zi-i} must be calculated before 2; can be 
computed. As an example, the following three equations form 
an ordered set of equations: 



zi = 1 

Z 2 = 32 X 

2 3 = Z\ + 22 2 . 



(2) 

(3) 
(4) 



When calculating a partial derivative it is necessary to 
specify which variables are held constant and which are 
allowed to vary. Typically, if this is not specified, it is 
assumed that all variables are held constant except those 
terms appearing in the denominator of the partial derivative. 
This is the convention adopted in this paper; thus, the partial 
derivative of 23 with respect to z\ , dzz/dz\ y is 1. 

An ordered partial derivative is a partial derivative whose 
constant and varying terms are determined using an ordered 
set of equations. The constant terms of the ordered partial 
derivative of zj with respect to 2,, which is denoted d+zj/dzj 
in order to distinguish it from an ordinary partial derivative, are 
{2 1? ...,2i_i}. The varying terms are {2;. . . . . zj, . . . . z n }. 
Using mathematical notation, the ordered partial derivative is 
defined as 



d+Zj _ dzj 
dzi " dzi 



{21 .... . 2i_x }held constant. 



Using this definition, the following two properties of the 
ordered derivative can easily be shown 



d+zi 



dz{ 



+1 



dzi 



dz { 



and 



— = 0 if j < 2. 

OZi 

When j > i + l f the ordered derivative is found using either 
of the following two chain rule expansions: 

d+zj _ dzj ^ d+zj dz k 

dzi *" dzi dz k d^' 



and 



dzj J y, dzj d+z k 
dz { ~ dzi + , ^rr, dz k dzi ' 



(6) 



As an example, using either the first or second chain rule ex- 
pansions and the ordered set of equations given in (2H4), the 
ordered partial derivative of 23 with respect to 21, d + zz/dz { , 
is 7. 

Finally, one comment on mathematical notation, throughout 
this paper it is assumed that a partial derivative of the form 
da/db, where a e F$ Axl] and b € fi (Bxl1 , is a matrix of 
the form R^ AxB l 



IV. Epochwise and On-Line Training 

Steepest descent algorithms adapt the weights of a discrete- 
time dynamical system by minimizing an error function. The 
definition of this error function is dependent upon whether the 
system is being trained in an epochwise or on-line mode. In 
this section, both epochwise and on-line training algorithms 
are defined. The epochwise and on-line error functions as well 
as their associated weight update equations are also presented: 
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A. Epochwise Training Algorithms 

An epoch is an iteration to iteration cycling of a discrete- 
time dynamical system from initial to final iteration (k - k f ). 
An epochwise training algorithm is any algorithm in which 
training takes place after each epoch or after a series of epochs. 

Epochwise systems are encountered much more fre- 
quently in control applications than in filtering applications. 
(Epochwise controllers are often referred to as terminal 
controllers.) In control applications, the error function used 
to train epochwise controllers is determined by the type of 
control desired. If the controller is required only to drive the 
plant to a desired final state d fc/ , then the error function 



This can usually be accomplished whenever a desire response 
in some form is accessible [38]. 

In both on-line controller and filtering applications, an error 
is usually defined at each iteration. At the forward iteration 
k', the error E k - is usually a function of the desired response 
vector d fc . and the output vector yjt-- For on-line applications, 
it is common to use the error function 



•yfc') T (dfc' -yfc')- 



(10) 



E = \{d kf -y k ,) T {d k ,-y kl ) 



(7) 



may be used. However, if the plant is to follow a trajectory 
given by a set of desired states, {d 0) . . . ,d t/ }, then the error 
function 



£ = ^i(d fc -y fc ) T (cl fc -y fc ) 



(8) 



fc=0 



can be used. The error function for filtering applications is 
also determined by the required performance. Most often in 
filtering applications the output is to track a desired response 
at each iteration. In such cases, the error function of (8) may 
be used to adapt the weights. 

Utilizing steepest descent, epochwise algorithms update the 
weights using 

^ + D = M0-/^) (9) 

where /x, the learning rate, is a suitably chosen positive 
constant. Notice that the ordered partial derivative of the error 
with respect to the weights is used to update the weights. The 
update rule generates a new weight vector W(i + 1) from 
the vector W(i). If either of the error functions defined m 
(7) or (8) are used, the weights are updated after each epoch. 
Therefore, if weight vector W(0) is used for the first epoch, 
weight vector W(i - 1) is used for the tth epoch. 

B. On-Line Training Algorithms 

If the weight update of an algorithm at the current iteration 
k' depends only on the states of the system at iterations 
{0, . . . , - 1, fc'}* lhen tne algorithm is defined to be an on- 
line training algorithm. The implied dependence only upon 
the current and past values of the system often allows the 
weight updates to be computed in real-time. The key difference 
between on-line and epochwise training algorithms is that 
an on-line algorithm adapts the weights of the system as it 
runs while an epochwise training algorithm only updates the 
weights after the final iteration. 

On-line training algorithms can be used in both adaptive 
control and filtering applications. In control applications, on- 
line training allows a controller to either track slow changes 
in plant dynamics or adapt to unknown plant characteristics. 
In adaptive filtering applications, on-line training can be used 
to develop nonlinear filters in real-time as the system runs. 



Using this error function, it is possible and usually preferable 
to update the weights at each iteration. Because of the update 
at each iteration, calculation of an exact error gradient is not 
possible. Instead, an approximation of the error gradient must 
be used to update the weights. Therefore, the on-line update 
rule at iteration k' is expressed as 

where p. is a suitably . chosen positive constant and 
d+E k -/dw(k') is the approximate error gradient. 

V. ALGORITHMS 
In this section the general algorithms for adapting discrete- 
time dynamical systems, both backsweep and recursive, are 
described. Because dynamical systems may operate in either 
epochwise or on-line modes, two versions of each of these 
algorithms are presented. In addition, specific algorithms for 
adapting discrete-time dynamical systems composed solely 
of neural networks, which we shall refer to as dynamical 
neural networks, are also described. These algorithms may 
be used directly to train neural network controllers and filters. 
Finally, in order to select the appropriate algorithm for training 
a dynamical neural network, it is necessary to understand 
the computational and storage requirements of the various 
methods. Therefore, an analysis of these requirements for each 
of the dynamical neural network algorithms is included. 

The differences between the backsweep and recursive al- 
gorithms can be traced to the differences in the derivations 
of the epochwise algorithms. (The on-line versions of both 
algorithms are easily derived from the epochwise versions. 
The detailed derivations of the epochwise algorithms are 
included m Appendices A and B.) As shown in Appendix A. 
computing the error derivative, d + E/dw(i), using repeated 
application of the first chain rule expansion, (5) results in 
the epochwise backsweep algorithm. Calculating d + E/dw{i) 
using the second chain rule expansion, (6), leads to the 
epochwise recursive algorithm. Even though both types of 
algorithms perform steepest descent, the characteristics of the 
two classes of algorithms are quite different. 

As is shown later in this section, the backpropagation 
algorithm is used in combination with either the backsweep or 
recursive algorithms when training dynamical neural networks. 
Therefore, an understanding of the computational requirements 
of the backpropagation algorithm is necessary in order to 
derive the complexity of the dynamical neural network algo- 
rithms. The backpropagation algorithm consists of a forward 
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propagation through a neural network, a backward propaga- 
tion, and a weight update for each training epoch [1]. Each of 
these computations requires on the order of Q multiplications 
and additions, where Q is the number of weights in the 
network. Therefore, the computational complexity of a weight 
update using the backpropagation algorithm is on the order 
of SQ operations. (The term operation shall refer to one 
multiplication and addition.) 

A. The Epochwise Backsweep Algorithm 

The epochwise backsweep algorithm is derived in full detail 
in Appendix A. The algorithm, its implementation, and its 
computational and storage requirements are presented in this 
section. The algorithm is implemented by performing the 
following at each epoch: 

1) Create an Epoch. Given the initial conditions Ro and 
Y 0 , the weight vector W(i) and the set of external 
inputs {ri, r 2 , . . . , r kf }, cycle the two equations 



W fc (i) = W(i), 

y fc = /(R*,Y fcl W k (i)) 



(12) 



starting from the initial iteration (k = 0) to the final 
iteration (k — kf). This process forms an epoch of the 
discrete-time dynamical system. 
2) Calculate the Lagrange multipliers. The Lagrange mul- 
tipliers, which are defined as 



A fc = 



d+E 
dy k ' 



are used to simplify the calculation of the error gradi- 
ent. Given an epochwise error function, the Lagrange 
multipliers are computed by cycling 



9E A . 



dy k 



backwards from iteration fc = fc/-ltoA; = 0. The 
following two equations are used to establish the initial 
conditions for (13): 



dE 
dyk/ 



Ajkj+i,..., A fc/ +£, =0. 

3) Compute the Error Derivative. Given the Lagrange mul- 
tipliers, the ordered partial derivative of the error with 
respect to a weight is calculated using 



8+E 
dw(i) 



fc=0 



dy k 

dw k (i)' 



(14) 



The term dy k /dw k (i) is the derivative of the output at 
iteration k with respect to one of the weights used in the 
dynamical system at iteration k. 
4) Update the weights. The weights are updated using the 
epochwise steepest descent rule (9). 



The epochwise backsweep algorithm can be used to train 
any discrete-time dynamical system provided that the deriva- 
tives of the output y fc with respect to the weights W(z) and 
recurrent inputs Y k exist. For any general system, the term 
dE/dy k of (13) is found by differentiating the error function. 
The terms dy k +j/dy k of (13) and dy k /dw k (i) of (14) can 
both be calculated by differentiating the output, which is given 
by (12). 

I) Backpropagation Through Time: If the function 
/(R*, Yjb, W(i)) is implemented by a feedforward neural 
network, as is the case for the IIR neural filter of Fig. 3 or 
the neural controller of Fig. 2 with a plant model represented 
by a neural network, it is possible to use a computationally 
efficient method to calculate the Lagrange multipliers of 
(13) and the error gradients of (14). This method, which is 
referred to as the backpropagation through time algorithm 
[I]. [I2HI8], is an example of the lowest layer of the theory 
on training adaptive dynamical systems. 

The efficiency of this method is achieved by using the 
backpropagation algorithm to compute partial deviatives which 
are of the form 



dE _ dEdy_ _ ^dy 
dw dy dw dw 



(15) 



Whenever the vector y of (15) represents the output of a feed- 
forward neural network, the error derivative can be efficiently 
computed by "backpropagating" the vector A through the feed- 
forward neural network [1]. In the dynamical neural network 
case, since is the output of a feedforward network with 
weights Wfc(t) and inputs R k and Yjt, the backpropagation 
algorithm can be used in combination with the backsweep 
technique to efficiently compute the weight updates. By back- 
propagating the vector A* through the feedforward network 
of iteration k y the product \ k dy k /dw k (i) of (14) can be 
efficiently computed. In addition, by backpropagating the 
vector Ajt to the recurrent inputs of this netwQrk, the set 



{A* 



dy k 
dy k -x 



dyk 
dy 



k-L 



■h 



which is used in (13), can be efficiently computed. 

To minimize the number of backpropagations required to 
calculate the epochwise error derivative, in the calculation 
of the vector \ k in Step 2, the vector \k+\ should be 
backpropagated all the way through the neural network of 
iteration k + 1, and the results should be stored in memory. 
Using this approach only kj + l backpropagations are required. 

If the function f(R k , Y fc , W(i)) is implemented only par- 
tially by a neural network, as is the case if the plant model 
of Fig. 2 is implemented by a set of equations, the back- 
propagation algorithm can still be used to reduce the com- 
putational requirements of calculating the error derivative. In 
such systems, vector-derivative products associated with the 
feedforward neural network portion must be computed at each 
iteration as part of the calculation of the terms of the sums 
of (13) and (14). By using the backpropagation algorithm 
for these calculations, the computational requirements can 
be minimized. Thus, a combination of the backsweep and 
backpropagation algorithms can also be used to train a neural 
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network controller when the plant model is implemented by 
a set of equations. 

2) Computational and Storage Requirements of Backpropa- 
Jation Through Time: In this section the computational and 
ftoTe requuLents of training dynamical neural networks us- 
ng L backpropagation through time algorithm arc discussed 
Sis algorithm is based upon k f + 1 forward and backward 
propagations through the dynamical neural network (Steps 1 
Ed2) for each epoch. Because each of these propagates 
requires on the order of Q multiplications and additions, where 
Q is the number of weights, 2(fc, + l)Q operations are needed 
to calculate the fc/ partial error derivatives of the right hand 
side of (14). Summing these derivatives and using .the result 
to update the weights (9) requires approximately (fc/ + l)Q 
additional operations. The total number of multiplications and 
additions for each weight update using the backpropagat.on 
through time algorithm is 

Cbb « 3(fc/ + l)Q- 
The storage requirement of the algorithm is composed 
of two different components. First, the weights and their 
associated error derivatives need to be stored in memory 
for efficient computation. These terms require 2Q memory 
locations. Secondly, the output vector y fc and vernal input 
vector r fc must be stored at each iteration of the forward 
propagation for use in the calculation of the error derivatives. 
Storing only the input and output vectors of each iteration 
minimizes the storage requirement while increasing the com- 
Ztion requirements by at most (fc/ + l)Q. This increase 
occurs because the internal nodes of the neural network must 
be recalculated in the backward sweep if these values are not 
^red. The external input r t is a vector of length M whj 
me output y fe is a vector of length N. Thus. (fc/ + 1)(M 4- N) 
memory locations are required for the external inputs and 
outputs. Adding the two components together, the minimal 
storage requirement of the algorithm is 

S EB *(k f + l)(M + N)+2Q. 



B. The On-Line Backsweep Algorithm 

The on-line version of the backsweep algorithm is based 
upon the concept of introducing artificial epochs into the 
system and then using the epochwise algorithm to update the 
weights. An artificial epoch is created by assuming that the 
current iteration k' is the final iteration of an artificial epoch. 
The first iteration of the artificial epoch is assumed to be fc - i . 
where T is a positive integer constant. If k' - T < 0. then the 
first iteration is k = 0. 

Calculating the error derivative using an artificial epoch of 
length T + 1 leads to an approximation of the true derivative. 
Thus, in the on-line case the weights must be updated using 
the following approximate error gradient 
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be noted that because of the weight updates at each iteration 
the terms dy k /dw k {i) generally becomes less accurate as fc 
is decreased from k' . To reduce these greater inaccuracies, i 
is possible to use exponential weighting (See Step 3 below) 
[46]. The on-line backsweep algorithm is implemented by 
performing the following at each iteration: 

1) Iterate the dynamical system. At the current iteration J: . 
given the input vector R fc <, feedback output vector Y fc . 
and weight vector W fc - (fc'). iterate the dynamical system 
once using 



d + E k 
dw(k') 



k' 



fc=fc'-T 



dy k 

dw k (iY 



where y fc and A fc are the approximate outputs and Lagrange 
multipliers respectively. (These are defined below.) It should 



y fc , = /(R fc .,Y fc .,W fc .(fc'))- 

The weight vector W fc .(fc') is an updated version of the 
weights used at the previous iteration. 
2) Calculate the approximate Lagrange multipliers The 
Lagrange multipliers cannot be calculated precisely be- 
cause of the weight changes at each iteration. An ap- 
proximation to the Lagrange multipliers can be found 
by backward iterating the following equation: 

k -*~ l k '- T - 

' =1 (16) 

where 

d+E k . 
~dy~k~ 

The approximate output y fc is defined as 

y fc «y fc = /(R fcl Y fc ,W fc (/0)- (17) 

In (17) y fc . which is calculated using the weight vector 
W fc (fc) is approximated by y fc . which is computed 
using W fc (fc'). Obviously, the smaller the difference 
between the weights at iteration fc and those at iteration 
fc' the more accurate this approximation will be and 
thus the more accurate the calculation of the Lagrange 
multipliers will be in (16). In the on-line case, it is 
common to use the mean squared error as the error 
function (10). In such cases, the boundary conditions 
of (16) take the form 

A fc , = -(dfc' -y*') 

Afe< + i Afc'+i, =0 

dE k < dE k > dE *' _ o. 

dyfc'-r' dyv-T+i ' " " 5 y<='-i 
3) Calculate the approximate error derivative. The approx- 
imate error derivative, with exponential weighting, is 
calculated using 

fc' 



d+E k 



y a k '~ k Xk 



dyk 



(18) 



dw{k>) k ^_ T h dw k {i)' 

where the constant 0.0 < a < 1-0 is the exponential 
weighting coefficient. The weighting causes the less 
precise terms of (18) to contribute less to the overall 
error derivative. 
4) Update the weights. The weights are adap ed at each 
iteration using the on-line update rule of (1 1 )• 
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One difficulty associated with the on-line backsweep algo- 
rithm is the selection of appropriate values for the constants T 
and a. Like the selection of the learning rate, there are no 
analytic procedures for choosing these constants. Instead, the 
selection should be based upon knowledge of the dynamical 
system, desired convergence rate and required misadjustment 
upon convergence. 

1 ) On-Line Backpropagation Through Time: Once again, 
the on-line algorithm can be used to train any discrete-time 
dynamical system whose output y*' is differentiate with 
respect to the weights W(fc') and recurrent inputs Y k >. 
Thus, the algorithm can be used for on-line adaptation of 
neural network IIR filters and controllers. If the function 
/(ft*, Yjt,W(i)) is implemented by a feedforward neural 
network, a combination of the backpropagation and on-line 
backsweep algorithms, which is referred to as on-line (or 
truncated) backpropagation through time [27], [47], can be 
used to update the weights. Recall that in the epochwise 
version, when a dynamical neural network is trained, it is 
necessary to backpropagate the vector Afc + i through the 
feedforward neural network with output yjt+i in order to 
calculate Ajb in Step 2. In the on-line version, the vector A*+i 
must also be backpropagated through the neural network 
with output yjt+i in order to compute \ k . However, before 
this backpropagation can occur, a forward propagation of 
R*+i and Y*+i through the feedforward neural network with 
weight vector W(fc') must be performed in order to establish 
the appropriate conditions for the backpropagation. Therefore, 
at each iteration of Step 2 of the on-line backpropagation 
through time algorithm, y*+i is first computed by a forward 
propagation, A*+i is then backpropagated and finally A* is 
calculated using (16). 

2) Computational and Storage Requirements: The compu- 
tational complexity and storage requirements of the on-line 
backpropagation through time algorithm for a dynamical neu- 
ral network with Q weights are easily derived from those 
of the epochwise algorithm. Because an artificial epoch of 
length T + 1 is used to update the weights in the on-line 
version, the computational and storage requirements can be 
found by simply making the substitutions T = kj into the 
epochwise requirements. Thus, the computational complexity 
at each iteration of the on-line backsweep algorithm is 

Cob«3(T+1)Q, 

while the storage requirement is 

Sob^(T+1)(M + W)+2Q. 

C. Epochwise Recursive Algorithm 

The epochwise recursive algorithm is derived in full detail 
in Appendix B. As shown in this derivation, the algorithm 
is developed using the second chain rule expansion, (6). The 
algorithm is implemented by performing the following at each 
epoch: 

1 ) Calculations performed at each iteration of the epoch. 
The epochwise recursive algorithm is based upon recur- 
sively updating derivatives of the output and error. These 



updates are computed using the following sequence of 
calculations at each iteration of the epoch. 

a) Select the weight vector and iterate the dynamical 
system. At each iteration fc, the weight vector is 
selected as 

W fc (z) = W(i), 
and the dynamical system is iterated once using 
y fc = /(R*,Y 4l W fc (i))- 

The dynamical system is initialized by selecting 
values for Ro and Yo- 

b) Recursively calculate the output derivative. At 
each iteration k y the output derivative is calculated 
using 



5y, 



d+y k 

dw(i) dw k ( 



k OYk 



dy k d+y k -j 
dw(i) ' 



(19) 



If k < L, where L is the maximum delay in 
the discrete-time dynamical system, the initial 
conditions 



dw(i) ' * * ' dw(i) 



= 0 



are used in (19). For k - j > 0, the values of the 
terms d+y k -j/dw(i) should be stored in memory 
having been calculated using (19) at previous 
iterations. 

c) Recursively calculate the error derivative. As 
shown in Appendix B, the epochwise error 
derivative is 



d+E 
dw(i) 



i) A: 



k=0 



dE d+y k 
dy k dw{i) ' 



which may be recursively calculated using 
dE d+y k 



Sk = Sk-i + 



dy k dw{iY 



(20) 



This equation is initialized by S_i = 0. It is easy 
to show that the epochwise error derivative is 



d+E 
dw(i) 



= s k/ . 



Therefore, to calculate the error gradient at each 
iteration k, S k should be updated using (20). Of 
course, the term d + y k /dw(i) of the recursive 
calculation (20) is obtained from Step t-b). 

2) Update the weights. After the final iteration of the epoch, 
the weights are adapted using the error derivative S kf 
and the epochwise update rule of (9). 
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I) Epochwise Recursive Backpropagation: The epochwise 
recursive algorithm can be used to train any discrete-time 
domical system provided that the derivatives of > 
with respect to both the weights and recurrent inputs exist. 
Except for the calculation of the output derivative in Step 1- 
b) implementation of the algorithm is fairly simple; therefore, 
we shall only discuss calculation of the output denvauve. 

To update the output derivative at each iteration using 
(19), it \s necessary to compute the direct output denvauve 
dvJdw k (i) and the Jacobian matrix 8y k /dY k . Different 
Piques, depending upon the form of the dynamical sys- 
tem can be used to calculate these terms. As shown below, 
if the structure of the discrete-time dynamical system is a 
neural network, these two components can be found us- 
ing N backpropagations of N appropriately selected vectors 
n A Ajv). (JV is the number of outputs of the 
system in me standard representation.) The N vectors take the 
form 



. f 1 if n = j 
nj 1 0 otherwis 



otherwise 

where A nj is the jth element of the row vector A„ e R [lxN] - 
Each vector has only one nonzero component, which appears 
in the n = j column. Because y fc is the output vector of 
a feedforward neural network, the backpropagation of the 
vector A n through the neural network at iteration fc results 
in the calculation of the output derivative of the nth output, 
as indicated by 

dyt _ dy k {n) 
n dw k (i) dw k {i)' 

Furthermore, as shown by 
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these terms for all iterations of the epoch is (k f +l)NQ. Given 
the results of the backpropagations, the output derivatives 
for each weight are calculated as shown in (19). This 
calculation requires L matrix-vector multiplications and L 
vector-vector additions per iteration to compute the output 
derivatives for each weight. Therefore, the computational 
requirements of calculating all the output derivatives for an 
epoch is (fc/ + l)(N*LQ +NQ). The error derivative for each 
weight is computed using the recursive equation of Step 1- 
c) This calculation requires N multiplications and additions 
per iteration. Therefore, the computational requirements of 
implementing Step 1-c) at each iteration for all weights 
is (fc/ + l)NQ. The total computational requirements ot 
the epochwise recursive backpropagation algorithm for a 
dynamical neural network is 

C ER *(k f + l)(N*L + 2N + l)Q. 

The storage requirements are determined primarily by the 
need to store the L + 1 output derivative vectors for each 
weight (19). The total storage requirements of these output 
derivatives is(L+ 1)NQ. In addition to the output derivatives, 
20 memory locations are need to store the we.ghts and error 
derivatives. Hence, the storage requirement of the epochw.se 
recursive algorithm is 

S br k(L + 1)NQ + 2Q. 



dy k _ dy k (n) 
Xn dY k " dY k ' 



backpropagating the vector A„ to the input nodes results 
in the calculation of the nth row of the Jacobian matnx 
We can conclude that the direct output gradient dy k /dw k (i) 
and Jacobian matrix dy k /dY k can be computed by back- 
propagating [\ u A„, AW through the fcth iteration 
of L dynamical network. Using the epochw.se recurs.ve 
algorithm with the direct output gradient and Jacob.an matrix 
calculated in this manner is referred to as epochwise recurs.ve 
backpropagation. 

2) Complexity and Storage Requirements: The epocnw.se 
recursive backpropagation algorithm is based upon the 
following three operations being performed at each iteration, 
a forward propagation of the dynamical system, a recursive 
update of the output derivative and a recursive update of 
the error gradient. When a dynamical neural network with 
Q weights is trained, the computational complex.ty of the 
fc, + 1 forward propagations is (fc/ + l)Q- The computational 
complexity of calculating the output derivative is determ.ned 
by the number of backpropagations required for an epoch 
and the complexity of the recursive output derivat.ve update. 
The two terms dy k /dw k {i) and dy k /dY k must be computed 
using N backpropagations at each of the fc/ + l iterations of the 
epoch Therefore, the computational complex.ty of calculat.ng 



D. On-Line Recursive Algorithm 

The calculation of the output gradient at each iteration in 
the epochwise version of the recursive algorithm depends only 
upon the current and past values of the dynamical system. 
The lack of dependence on future values of the system in 
calculating the output gradient makes the algorithm attractive 
for on-line implementation. The on-line recursive .algorithm is 
implemented by performing the following attach .teration: 
1) Select the weight vector and iterate the dynamical sys- 
tem. At the current iteration fc', the weight vector .s 
selected as 



Wfc.(fc') = W(fc'), 
and the dynamical system is iterated once using 
y fc , = /(R fcS Y fc .,W fc .(fc'))- 

The dynamical system is initialized by selecting values 

for Ro and Yo. . . 

2) Recursively calculate an approximate output denvattve. 
The approximate output derivative is computed at each 
iteration using 

dZy~ k - dyj, Y^^jfS^L- (21) 

a^ifc 7 ) ~ dw k .(k>) + fr[ «y*'-i dw ( k ~ j) 

where 0.0 < a < 1.0. The precise output derivative 
cannot be calculated because of the weight changes 
at each iteration. The exponential weighting term a is- 
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TABLE I 

Computation and Storage Requirements. 



ZZ Epochwise algorithms Online algorithms 

Backpropagation through 

Requirements. Backpropagation through time Recursive backpropagation time Recursive backpropagation 

Computational 3(k f + l)Q (Jl7 + 1)(A' 2 I+3A T +1 )Q+Q (2T + 3)Q {N 2 L + N(L + 3) + 2)Q 

Storage 2Q + (k f + \)(M + N) {L+1)NQ 2Q + (T + l)(Af + N) (L+1)NQ 



introduced to reduce the effects of the less accurate terms 
of the summation [46]. Equation (21) is initialized using 



3+y-i 



d+y-L 



= 0. 



dw(-l) ' " * * ' dw(-L) 

The second term in the summation of (21) is either 
stored in memory having been calculated using this same 
equation at a previous iteration or determined by the 
initial conditions. 
3) Recursively calculate an approximate error derivative. 
Assuming the mean squared error is used as the er- 
ror criterion, the approximate error derivative can be 
calculated using 



TV = -(d*<-y*<r 



(22) 



dw k ,{k f ) JK/ dw k >(k') 

Once again, the error derivative cannot be precisely cal- 
culated because of the weight changes at each iteration. 

4) Update the weights. At the end of each iteration, the 
weights are adapted using the on-line update rule of (1 1). 

1 ) On-Line Recursive Backpropagation: The on-line recur- 
sive algorithm can be used to adapt the weights of neural 
network IIR filters and controllers. As discussed in the previ- 
ous subsection, in these cases the two terms dyv /dwk'(k') 
and dyv /dYk' can be calculated using N backpropagations 
at each iteration. This technique of combining the on-line 
recursive technique and the backpropagation algorithm is 
referred to as the on-line recursive backpropagation algorithm 
[15], [16], [29H32]. 

The computational requirement of the on-line recursive 
backpropagation algorithm follows almost immediately from 
those of the epochwise version. The only differences between 
the two versions of the recursive algorithm are the exponential 
weighting of the output derivative calculation (21) and the 
need for weight updates at each iteration in the on-line 
case. The exponential weighting adds an additional LNQ 
computations to the calculation of the output derivatives, 
therefore, using the results of the previous subsection, the 
output derivative computation requires (N 2 L + N(L + l))Q 
operations. At each iteration, the approximate error gradient 
calculation (22) requires NQ multiplications and additions 
while the weight update requires Q operations. We conclude 
that the computational complexity per iteration of the on-line 
recursive algorithm is 

Cor * (N 2 L + N{L + 2) + 2)Q. 

The exponential weighting and weight update of the on-line 
algorithm does not introduce any additional storage require- 
ments. Therefore, the storage requirement is identical to that 



of the epochwise version: 

Son « (L + l)NQ + 2Q. 

VI. Comparison of Algorithms 

The computational and storage requirements of the dynam- 
ical neural networks training algorithms presented in Section 
V are outlined in Table I. In this section, we use the results 
summarized in this table to compare the algorithms. It should 
be noted that although we are concerned only with the algo- 
rithms for dynamical neural networks in this section, many of 
the conclusions arrived at in this section apply equally well to 
the associated generalized algorithms. 

Starting with the epochwise algorithms, we find the back- 
propagation through time technique to be computationally 
superior to the recursive method. The inefficiency of the 
recursive method is a result of the matrix-vector multiplication 
in the output derivative calculation of (19). Although the 
recursive algorithm is inefficient, it has the advantage of 
fixing the storage requirements to (L + l)NQ + 2Q, which 
is independent of the number of iterations in an epoch. In 
some cases, where the total number of iterations kj + 1 in 
an epoch is large compared to the number of weights Q, 
the epochwise recursive algorithm may be advantageous to 
use for this reason. However, in most cases the total number 
of iterations is small compared to the number of weights, 
and the storage requirements favor the use of the epochwise 
backsweep algorithm. Because the backsweep algorithm is 
computationally more efficient and requires less storage in 
most cases, it should be used to train epochwise dynamical 
neural networks, including neural network controllers and 
filters. 

The ratio of the computational requirements of the two 
on-line algorithms, 

Cob „ 3(r+l)Q - 

C OR ~ (N*L + N(L + 2) + 2)Q' 

can be used to select the most efficient on-line method. 
Unlike in the epochwise case, neither algorithm is universally 
computational superior to the other. Instead, the selection of 
the most efficient on-line method depends on the number of 
outputs N, the maximum delay in the feedback loop L and 
the window length of the backsweep algorithm T. In addition 
to computational efficiency, the storage requirements should 
also be used to select the appropriate on-line algorithm. On the 
basis of the storage requirements alone, the on-line backsweep 
algorithm is preferable to the on-line recursive technique when 
(T+1)(AT+M) < (L+l)NQ. For almost all dynamical neural 
network systems, this inequality will hold and the storage 
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requirements will favor the use of the on-line backpropagation 
through time algorithm. 

The selection of an on-line algorithm for training a neural 
network IIR filter should be based on the number of outputs 
and the maximum delay in the feedback loop of the dynamical 
system. IIR filters with one output and a moderate number of 
feedback delays are commonly used in adaptive systems. For 
such filters, the computational requirements favor use of the 
on-line recursive algorithm while the storage requirements fa- 
vor use of the on-line backsweep algorithm. However, because 
the computational requirements are often more important than 
the storage requirements, the on-line recursive algorithm is 
recommended for adapting single output neural network filters. 
(It should be pointed out that on-line recursive techniques for 
adapting single output linear IIR filters have been well-studied 
[38], [39], [48] while we are unaware of any attempts to use 
the on-line backsweep algorithm for such filters.) While the 
recursive algorithm is preferable for training single output 
filters, the backsweep algorithm is better suited for training 
multidimensional filters. In the multidimensional case, the 
number of outputs is greater than one and the backsweep 
algorithm is therefore computationally superior. Summarizing 
the results above, the recursive algorithm is found to be 
superior for training single output IIR neural network filters, 
while the backpropagation through time algorithm is found to 
be better suited for adapting epochwise and multidimensional 
filters. 

The selection of an on-line algorithm for training neural 
network controllers is also based on the number of outputs 
and the maximum delay in the feedback loop of the dynamical 
system. For control systems in their standard representation, 
the number of outputs N must be greater than or equal to 
the number of states in the plant. Generally, the number 
of states is larger than one. The maximum delay in the 
feedback loop in most control systems is one. Under these 
circumstances, the computational requirements tend to support 
use of the backpropagation through time algorithm. Because 
the storage requirements will also generally favor use of the 
on-line backpropagation through time algorithm, this algorithm 
is preferable for on-line training of neural network controllers. 
Given these observations along with those above for the 
epochwise algorithms, one can conclude that the backpropa- 
gation through time algorithms are preferable for training both 
epochwise and on-line controllers. 



VII. Examples 

In this section, two examples that illustrate the uses of the 
dynamical system training algorithms are presented. The first 
example demonstrates the use of the algorithms for nonlinear 
controller design. In this example, a neural network is trained 
using the epochwise backsweep algorithm to provide the 
steering angle of a boat which is placed in a river with a 
nonlinear current. By providing the proper steering angle, the 
neural network guides the boat across the river to a designated 
dock position. The second example illustrates the use of the on- 
line recursive algorithm for adaptive filtering. In this example, 



an adaptive noise canceller is trained to eliminate noise from 
a corrupted signal. 

A. Nonlinear Control Example 

In this example, a boat is initially placed in a river, which 
is 200 ft wide, within a region 100 ft upstream or downstream 
of a dock. The boat is powered by a constant thrust motor 
which is also used to point the boat in any desired direction. 
Starting from any initial position, it is desired to maneuver the 
boat to a dock, which is located on one shore of the river. 
Maneuvering the boat to the dock is made difficult by the 
stream's nonlinear current. 

At iteration fc, let x k denote the distance from the center of 
the boat to the shore with the dock. Let y k denote the distance 
of the center of the boat upstream or downstream of the dock. 
Assuming the current to be a function only of the distance 
from the shore, x k , the equations of motion for the boat are 

= Xk + I0cos{u k ) (24) 
y*+i = Vk + lOsm(ujb) + f c (zk)- ( 25 > 

where u k% the orientation of the boat given in radians, is the 
control signal, and f c (xk)> the influence of the current on 
the boat, is given in feet per iteration. The current, which 
is parabolic in nature with the greatest force in the middle of 
the stream at x = 100, is given by the following equation: 

/c(x t ) = 7. 5 (^-(^) 2 ). 

The control signal is supplied by the output of a three layer 
neural network. The first layer contains the two inputs, x k 
and yjt, which are the states of the system. The hidden layer 
contains ten sigmoidal units which are fully connected to the 
inputs and a bias. The output layer, which is linear, is fully 
connected to the hidden layer and the bias. 

The boat system operates in an epochwise manner with the 
initial position determined randomly during training and the 
final position specified as the iteration prior to the boat hitting 
the dock's shore. For reasons discussed in Section VI, the 
epochwise backsweep algorithm was selected for training the 
neural network controller. The plant model was implemented 
by (24), (25). To drive the boat to the dock at the final iteration, 
the following error function was used: 

E = {x d -x kf f + {y d -y kf ) 2 , 

where x d is the x position of the dock, and y d is the y position 
of the dock. 

To train the neural network controller, 4000 training epochs 
were required (m = 0.0001). After training was completed, 
four demonstration epochs, which are shown in Fig. 4, were 
run. In the lower portion of Fig. 4, the current is shown as a 
function of x. (To show the boat graphically, it was necessary 
to move the two shores outward a distance equal to half the 
boat length. For this reason, the current near both shores is 
shown as zero.) The four demonstration epochs show that 
by using the backpropagation through time algorithm, it is 
possible to design a neural network controller for guiding the 
boat across the river. 
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Rg. 4. Nonlinear control example. 
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Fig. 5. Adaptive noise -cancelling system. 

5. Adaptive Filtering Example 

In this example, an adaptive noise cancelling system is 
used to reduce additive noise from a corrupted signal [38]. 
Fig. 5 shows an illustration of an adaptive noise cancelling 
system. The original signal s k is corrupted by a noise signal 
71*, which is a filtered version of the noise source r k . The noise 
source, which can be sensed directly, is used as input to an 
adaptive filter. By training this filter using an on-line error of 
Ek' - (5jt' 4- nfc' -yjt') 2 » the expected value of E[(n k -y*) 2 ] 
is locally minimized [38]. Therefore, if the adaptive filter can 
model the noise filter, the output of the adaptive filter y k will 
approximate n*, and the noise in the system output e* will be 
significantly attenuated. In order to allow proper training of 
the adaptive filter, it is required that the noise source signal 
r k and original signal s k be uncorrelated. 

If the noise filter is linear, a linear adaptive filter should be 
used to cancel the noise [38]. If the noise filter is nonlinear, 
a neural network filter can be used. In the example presented 
below, the noise filter was a nonlinear IIR filter, therefore, 
a neural network IIR filter was trained using the recursive 
algorithm to cancel the noise from the corrupted signal. 

In the noise cancelling example, the original signal was 




1000 2000 3000 4000 5000 6000 
Iteration 

Fig. 6. Learning curve of the noise cancelling system. 



The noise source signal r fc was a white uniformly distributed 
random variable with a range between -1.0 and 1.0. This noise 
source was filtered to produce n k using the following nonlinear 
difference equation: 



njfc = r k + f n (n k -\), 



(27) 



where 



fn(n k -i) - .5cxp 



/ -K-i-io) 2 \ 

V 0.67 J 



s k = .25cos(.4fc). 



(26) 



-• 5exp i — 0^7 — )' (28) 

It should be noted that the noise filter contained nonlinear 
feedback. 

The adaptive filter was implemented by a three-layer feed- 
forward neural network. The input layer was composed of two 
components, the noise source signal r k and the previous output 
of the adaptive filter y k -\. The hidden layer was composed 
of 17 sigmoidal units. The first five nodes were connected to 
the noise source signal r k while the remaining ten nodes were 
connected to the feedback signal y k -\. (Each of the hidden 
units were connect to a bias input.) The output layer contained 
one linear unit which was connected to the hidden nodes and 
the bias input. 

The filter was trained using the on-line recursive algorithm 
with a leam rate of /* = .005 and an exponential weighting 
factor of a — .95. A learning curve is shown in Fig. 6. The 
initial sharp decrease in the mean squared error over the first 
couple hundred iterations is due to relatively fast learning of 
the feedforward component of the filter. The slow decrease, 
which lasts for several thousand iterations, is due to learning 
the feedback component. The corrupted signal s k + n k and the 
original signal s k for iterations 5900-6000 are shown in Fig. 
7. Notice that it is impossible to determine the characteristics 
of the original signal from the corrupted signal. The output 
signal Cfc and original signal s k for these same iterations are 
shown in Fig. 8. Although the output signal is not perfect, the 
noise has been significantly reduced. 

VIII. Conclusion 

In this paper a unified framework for discrete-time dy- 
namical system training algorithms is developed. Using this 
framework, it is possible to select the appropriate training „ 
algorithm for designing neural network controllers and filters. 
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y* = /(R*,Y*,W*(t)) 



(32) 
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Iterations 



Fig. 7. Corrupted signal and original signal. 




W fc/ (i)«W(i) 03) 

yk f = f(Rk,,Y k/ ,w kf (i)) (34) 

E = /(y 0 , yi, . . . , y kf , do, d lt . . . , d kf ). (35) 

Using the first chain rule expansion (5) along with this set of 
ordered equations, the partial derivative d+E/dw(i) may be 
written as 



d+E 9W.(z) 
a\Vjb(i) dw( 



3E_ ( d+E dy k 

dw(%) ~ dw(i) \ dy k dw(i) 

k ~° (36) 
The term dE/dw(i) and the components of the vector 
dy k /dw{i) are equal to zero because E and y* are not 
direct functions of w(i). A Therefore, (36) can be written as 



5900 5910 5920 5930 5940 5950 5960 5970 5980 5990 6000 
Iterations 

Fig. 8. Output signal and original signal. 

As shown in this paper, the training of neural network 
controllers and filters can be accomplished using either the 
backpropagation through time or recursive backpropagation 
algorithms. The most important difference between the two 
techniques is the difference in computational and storage 
requirements. For the backpropagation through time algorithm, 
these requirements are independent of the number of outputs, 
while for the recursive algorithm, they increase quadraticaily 
with the number of outputs. Therefore, the backsweep algo- 
rithms should be used to train neural controllers because the 
number of outputs of a neural network control system in the 
standard representation is generally significantly larger than 
one (the number of outputs is greater than or equal to the 
number of states of the plant). For neural network IIR filters 
with a single output operating in an on-line mode, the system 
should be trained using the recursive algorithm, while for 
filters with multidimensional outputs, the system should be 
trained using backpropagation through time. 

Appendix A: Derivation of 
Epochwise Backsweep Algorithm 

Steps l)-)3 of the epochwise backsweep algorithm given in 
Section V-A are derived in full detail below. The standard 
representation "of Section II-A is used in this derivation. As 
shown below, the chain rule expansion of (5) is used repeatedly 
in deriving the algorithm. 

Our primary goal in deriving the algorithm is to find an 
expression, for the term d+E/dw(i), which is used to update 
the weights. We begin by noting that Step 1 of the algorithm 
produces the following ordered set of equations: 



d+E 
dw{i) 



3+je; 3w fc (o _ ^ d+E 



£aw*(i) dw(i) 



= 2J 



^dw k (iy 



(37) 



We need to find an expression for the term d+E/dw k (i). Once 
again, this expression can be found by expanding the ordered 
derivative using (5), 



d+E 



dE 



dw k {i) dw k {i) 



d+E dyj 



4-< dyj dw k (i) 

j— Ac 



d + E dV/j(i) 



i5i aw ' w dwk ® 



The term dE/dw k {i) and the components of the vector 
dWj(i)/dw k (i) are equal to zero. The term dyj/dw k (i) 
is nonzero only when k = j. Using these results, we find 
d + E/dw k {i) to be 

d+E d+E dy k 



dw k (i) dy k dw k {i)' 
Substituting (38) into (37), the error partial derivative is 



d+E 
dw(i) 



_^ d+E dy k 
^ dy k dw k (i) 



fc=0 



dy k 
dw k (i) ' 



(38) 



(39) 



where 



d+E 

dy k 



W 0 (i) = W(i) 

yo = /(Ro ! Y 0l W 0 (i)) 

W fc (i) = W(i) 



(29) 
(30) 

(31) 



is the Lagrange multiplier. Equation (39) gives us Step 3 of the 
algorithm. The term dy k /dw k (i) of (39) is easy to calculate. 
The term d+E/dy k can be found using the first chain rule 
expansion (5) once again 

aw,(i) dy k 

(40) 

4 In our definition of the ordinary partial derivative all terms are held 
constant except the terms in the denominator of the partial derivative. 
Therefore if the function which defines the numerator of the partial derivative 
does not contain the terms of the denominator directly, then the partial 
derivative is zero. 



_ d+E _ dE_, X f d+E dyj 
k "dy7~ dyk.^Kdyj dy k 
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AH components of the matrix dWj(i)/dy k are equal to zero. 
When j > k + L, where L is the maximum number of delays 
in the system, all components of the matrix dyj/dy k are also 
equal to zero. Using these results, (40) can be written as 



A - dE I V d+E **** 
" dYk j^dy k+j dy k 



dy k 



(41) 
(42) 



Equation (42) is a backward difference equation that can be 
solved using the following boundary conditions: 



dE 
dy k} 



(43) 
(44) 



Equation (42) gives us Step 2, completing the derivation of 
Steps 1-3 of the algorithm. As a point of interest, (42), (43) 
and (44) along with 



d+E 
dw(i) 



= 0 



are the discrete-time Euler-Lagrange equations for a dynamical 
system in the standard representation [7]. The epochwise 
backsweep algorithm is a numerical technique for solving 
these equations. 

Appendix B: Derivation of the 
Epochwise Recursive Algorithm 

The epochwise backsweep algorithm is derived using the 
first chain rule expansion, (5). In this section, the second chain 
rule expansion, (6), is used to derive the epochwise recursive 
algorithm. Our primary goal in deriving the algorithm is to 
find an expression for the term d+E/dw(i). We begin by 
noting that implementation of Step 1-a) of the algorithm 
along with the calculation of the error results in the set of 
ordered equations shown in Appendix A (29H35). Using the 
second chain rule expansion, (6), along with this set of ordered 
equations, the ordered partial derivative of the error may be 
written as 



d+E 



dE 



dw(i) dw(i) 



k=0 V 



BE d+y k dE d+W 



dy k dw{i) dW k (i) dw( 



(45) 

The term dE/dw{i) and the components of the vector 
dE/dW k (i) are equal to zero because the error E is not 
a direct function of w(i) and W fc (z). Therefore, (45) can be 
written as 



d+E _X dE d+y k 



In the algorithm, this equation is computed recursively using 
Step 1-c). The first term of (46), dE/dy ky is easy to compute. 



The second term d+y k /dw(i) can be found using the second 
chain rule expansion. 

d+y k ^ dy ± ^ dy k fl+W^t) dy k d+y, 
dw(i) dw(i) fr^dWjii) dw(i) ^^ Q d yj dw(i) 

(47) 

The components of the vector dy k /dw(i) are equal to zero. 
The term dy k /dWj(i) of the first summation is nonzero 
only when k = j, therefore, this summation only contains 
one nonzero term. As a result, the first summation can be 
written as dy k /dw k (i). In addition, the first term of the second 
summation dy k /dyj is nonzero only when k - j < L. Using 
these results, (47) can be written as 



d+y k 



dyi 



dw(i) dw k (i) j^dyu-i dw(z) 



dy k d+yk-. 



(48) 



This equation is used in Step 1-b) to recursively calculate the 
output gradients for the entire epoch. It is initialized using 

d+y-i d+y- L 



dw(i) dw{i) 



= 0. 



Equation (48) completes the derivation of the epochwise 
recursive algorithm. 
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